Ensem-HAR: An Ensemble Deep Learning Model for Smartphone Sensor-Based Human Activity Recognition for Measurement of Elderly Health Monitoring

Biomedical images contain a huge number of sensor measurements that can provide disease characteristics. Computer-assisted analysis of such parameters aids in the early detection of disease, and as a result aids medical professionals in quickly selecting appropriate medications. Human Activity Recognition, abbreviated as ‘HAR’, is the prediction of common human measurements, which consist of movements such as walking, running, drinking, cooking, etc. It is extremely advantageous for services in the sphere of medical care, such as fitness trackers, senior care, and archiving patient information for future use. The two types of data that can be fed to the HAR system as input are, first, video sequences or images of human activities, and second, time-series data of physical movements during different activities recorded through sensors such as accelerometers, gyroscopes, etc., that are present in smart gadgets. In this paper, we have decided to work with time-series kind of data as the input. Here, we propose an ensemble of four deep learning-based classification models, namely, ‘CNN-net’, ‘CNNLSTM-net’, ‘ConvLSTM-net’, and ‘StackedLSTM-net’, which is termed as ‘Ensem-HAR’. Each of the classification models used in the ensemble is based on a typical 1D Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network; however, they differ in terms of their architectural variations. Prediction through the proposed Ensem-HAR is carried out by stacking predictions from each of the four mentioned classification models, then training a Blender or Meta-learner on the stacked prediction, which provides the final prediction on test data. Our proposed model was evaluated over three benchmark datasets, WISDM, PAMAP2, and UCI-HAR; the proposed Ensem-HAR model for biomedical measurement achieved 98.70%, 97.45%, and 95.05% accuracy, respectively, on the mentioned datasets. The results from the experiments reveal that the suggested model performs better than the other multiple generated measurements to which it was compared.


Introduction
Human Activity Recognition (HAR) is one of the most active and fascinating study fields in computer vision and human-computer interaction. In the fields of ubiquitous computing, interpersonal interactions, and human behavior analysis, automatically recognizing human physical activities has become a serious challenge. Because of the enormous advancements in microelectronics over the last decade, several complicated and high Figure 1. Illustrative overview of collecting the raw data from sensors, pre-processing and window segmentation on that raw data, and finally our proposed Ensem-HAR model for predicting human activities with that pre-processed data.

Literature Review
HAR is a challenging research topic in the area of computer vision. For a long time, researchers all around the world have been working on developing a near-perfect recognition model. HAR has previously been the subject of a great deal of research. This section focuses primarily on summarizing previous activities taken concerning the datasets chosen here.
Deep learning algorithms are very good at processing time-series signals for feature extraction and classification, taking advantage of local dependencies. Scholars have recently become attracted to the application of profound deep learning methods such as Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and hybrid models for better recognition of human activity [6].
In the field of classifying images as well as for forecasting raw time-series signals, CNN models are highly capable. Several scholars have taken full advantage of this by using CNN algorithms to recognise raw inertial data from sensor devices and detect human motion.
Chen and Xue [7] proposed a CNN model that recognizes simple motions based on the three-axial signals collected from accelerometers integrated into a smartphone. To process the signals, they built a CNN model and changed the convolutional kernel accordingly. Ronao and Cho [8] developed a unique CNN design for extracting complicated features that makes use of an exploited (1 × 91 × 14) convolutional layer with a small pooling size (1 × 2-1 × 3). The proposed method has been tested using raw data and the temporal properties of Fast Fourier Transformed signals using the created CNN.
The WISDM project was introduced by Kwapisz et al. [9], who chose 29 users to collect data from. On the WISDM dataset, Quispe et al. employed traditional machine learning-based classifiers K Nearest Neighbor (KNN) [10] and achieved the state-of-the-art results. Table 1 highlights previous measurement research on the WISDM dataset. Table 1. Performance measurement of some latest HAR approaches on the WISDM dataset.

Literature Review
HAR is a challenging research topic in the area of computer vision. For a long time, researchers all around the world have been working on developing a near-perfect recognition model. HAR has previously been the subject of a great deal of research. This section focuses primarily on summarizing previous activities taken concerning the datasets chosen here.
Deep learning algorithms are very good at processing time-series signals for feature extraction and classification, taking advantage of local dependencies. Scholars have recently become attracted to the application of profound deep learning methods such as Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and hybrid models for better recognition of human activity [6].
In the field of classifying images as well as for forecasting raw time-series signals, CNN models are highly capable. Several scholars have taken full advantage of this by using CNN algorithms to recognise raw inertial data from sensor devices and detect human motion.
Chen and Xue [7] proposed a CNN model that recognizes simple motions based on the three-axial signals collected from accelerometers integrated into a smartphone. To process the signals, they built a CNN model and changed the convolutional kernel accordingly. Ronao and Cho [8] developed a unique CNN design for extracting complicated features that makes use of an exploited (1 × 91 × 14) convolutional layer with a small pooling size (1 × 2-1 × 3). The proposed method has been tested using raw data and the temporal properties of Fast Fourier Transformed signals using the created CNN.
The WISDM project was introduced by Kwapisz et al. [9], who chose 29 users to collect data from. On the WISDM dataset, Quispe et al. employed traditional machine learning-based classifiers K Nearest Neighbor (KNN) [10] and achieved the state-of-the-art results. Table 1 highlights previous measurement research on the WISDM dataset. Table 1. Performance measurement of some latest HAR approaches on the WISDM dataset.

Author
Year of Publication Model/Classifier Accuracy (in %) Kwapisz et al. [9] 2010 MLP 91.7% Zhang et al. [11] 2018 U-Net 97% Quispe et al. [10] 2018 KNN 96.2% Pienaar et al. [12] 2020 RNN-LSTM 94% Ignatov [13] combined statistical features with CNN to obtain information of raw signals. Wan et al. [14] used CNN in the extraction of the local features of sensor data obtained from smartphone signals, reducing the cost of energy consumption. The impact of signal duration on performance was investigated in this study by adjusting the sliding window up to 1 s. Avilés-Cruz et al. [15] established a deep learning method for classifying and analyzing exclusive user-dependent activity recognition using the CNN model. Table 2 highlights previous measurement research on the PAMAP2 dataset.
To minimize the memory and computational expenses of typical CNNs, Lego filters were employed instead of convolutional filters by Tang et al. [16]. The suggested lightweight model does not need any particular network formation or computer resources, and it improves the efficiency and scalability of the experiments. Similar to [16], Cheng et al. [17] presented a new computer-efficient HAR for mobile and wearable devices which uses conditionally parameterized HAR convolution. Experiments were carried out to demonstrate the efficiency of the larger baseline model network.
The CNN models discussed above are capable enough to attain greater accuracy in the recognition of human activities. An assembly of CNNs with variable layers and filters to eliminate accuracy variations was presented by Zhu et al. [18]. Their model recognizes the confused actions and dynamic activity of persons with fewer training data. Instead of classifying images with CNN models, LSTM models are very good at predicting raw time-series signal sequences. CNN models use "spatial correlations" to categorize images, whereas an LSTM model classifies time series data by processing a complete sequence of data through a feedback link. Researchers have presented various strategies for LSTM-based HAR models.
Agarwal et al. [22] suggested a lightweight profound learning approach to constructing a HAR with fewer computational resources and less delay, allowing the proposed model to be easily employed in real-world applications. Rashid et al. [23] extended the CNN program [22] and offered a low-power adaptive CNN that is energy-efficient and memorable. Zhao et al. [24] developed a bi-directional residual LSTM architecture with the benefit of combining forward and backwards-looking states and a good-and badlooking direction in time. The residual link used between the stacked cells prevents the problem of gradient disappearance. Table 3 highlights previous measurement research on the UCI-HAR dataset.
The usual solution for the HAR system is to use mostly local features gathered using heuristic approaches. A profound hybrid approach based on integrating CNN with LSTM has been proposed in a study by Sun et al. [25]. Another important challenge in HAR is analyzing the poorly labelled sensor data to deal with the LSTM model. In order to improve performance with weakly labelled sensor data, Zhou et al. [26] designed a semi-surveyed LSTM learning architecture employing a Deep Q-Network.
Hybrid models are thought to be more effective than standard deep learning models in a variety of situations, including adequate training, perplexing actions, and device placement [27]. Table 3. Performance measurement of the latest HAR approaches on the UCI-HAR dataset.
Although CNN, LSTM, and hybrid models are important in HAR problems, CNNbased models have several drawbacks; they require a large quantity of data during their training and are both time-and cost-ineffective. To solve the aforementioned difficulties, Mondal et al. [31] implemented Graph Neural Network (GNN) to transform time-series data into a structural representation of graphs.
He et al. [32] suggested a moderate supervised HAR method that deals with sensor data using 'recurrent attention learning'. Here, the CNN features are retrieved through multiple iterations using a rewarding method of reinforcement learning. Because it is harder to label a dataset collection for a long and complicated series of actions, Zhu et al. [33] used interim assembly of LSTM. Li et al. [34] proposed a model based on residual block and BiLSTM. Residual block is utilised to extract spatial features from multidimensional signals, and the forward and backward dependencies of the feature sequence are derived using BiLSTM.
In the preceding section, we have addressed the use of CNN, LSTM, and hybrid models in the area of HAR. In this paper, we discuss the deep learning-based paradigm and our construction of an ensemble with four deep learning classifiers designed with CNN, LSTM, and hybrid architectures, which can effectively predict human activities across three conventional benchmark datasets of time series.

Materials and Methods
We considered the following four deep learning-based models. 1. CNN-net (a 1D CNN model with three levels); 2. CNN-LSTM-net (a 1D CNN model with three levels and an LSTM model); 3. ConvLSTM-net (a time-distributed CNN fed to an LSTM before a dense layer); and 4. StackedLSTM-net (a two-layered LSTM) as our base models. By using these base models, we formed our ensemble, called Ensem-HAR, which is discussed in the following sections. First, we discuss each of the base models in brief and present their diagrams.

CNN-Net Model
A CNN is a deep cascaded artificial neural network (ANN) that is made up of many layers of neural networks, each with a number of neurons. Several critical network layers, such as the "Convolutional Layer", the "Pooling Layer", and the "Dense Layer", play diverse functions in CNNs. The architectural overview of the proposed convolutional neural network model (CNN-net) is depicted in Figure 2.
The CNN model subsequently processes the segmented data. This model comprises three levels of CNN, each consisting of CNN layers, with filters of different kernel sizes in each layer.

•
Each CNN level consists of four convolution layers (1D) and incorporates the ReLU activation function, which reduces non-linearity. The filters are of different kernel sizes for each layer, while the number of filters is the same for each level.

•
At the very first level, every convolution layer present extracts features from the input windows based on different sized kernels. The features retrieved from each layer in the first level are concatenated, then a max-pooling layer with a five-size pool generates a summary of the extracted features provided by the convolution layers reducing the computation costs.

•
The features extracted from the first MaxPool layer are then fed into the second and third set of a four-layered CNN consecutively in a similar fashion. The layers in both the second and third levels have different sizes of kernels, as in the first level; however, the count of filters used in each layer is the same for the respective levels (64 and 32).

•
The output from the third MaxPool layer is flattened and fed to the classification layer. This layer is made up of two fully connected (FC) layers that use the SoftMax Activation function on their inputs.

•
The addition of dropout after the first layer of the FC layer is performed for regularization, i.e., to minimize the likelihood of overfitting. The Adam optimization method is used by all of the systems for weight updating and loss computation.
erates a summary of the extracted features provided by the convolution layers reducing the computation costs.  The features extracted from the first MaxPool layer are then fed into the second and third set of a four-layered CNN consecutively in a similar fashion. The layers in both the second and third levels have different sizes of kernels, as in the first level; however, the count of filters used in each layer is the same for the respective levels (64 and 32).  The output from the third MaxPool layer is flattened and fed to the classification layer. This layer is made up of two fully connected (FC) layers that use the SoftMax Activation function on their inputs.  The addition of dropout after the first layer of the FC layer is performed for regularization, i.e., to minimize the likelihood of overfitting. The Adam optimization method is used by all of the systems for weight updating and loss computation.

CNN-LSTM-Net Model
Long short-term memory networks, or "LSTMs", are a type of recurrent neural network that can learn long-term dependencies. Their default behaviour is to keep information in their memory for longer periods.
The mentioned architecture is an extension of the previously proposed model, CNNnet. In this proposed model, we added an LSTM network in parallel with the CNN-net; the features extracted from them were passed through the dense layer and a dropout layer, respectively, then concatenated and fed to the classification layer and SoftMax activa-tion function. The architectural overview of the proposed CNN-LSTM-net is depicted in Figure 3.
work that can learn long-term dependencies. Their default behaviour is to keep information in their memory for longer periods.
The mentioned architecture is an extension of the previously proposed model, CNNnet. In this proposed model, we added an LSTM network in parallel with the CNN-net; the features extracted from them were passed through the dense layer and a dropout layer, respectively, then concatenated and fed to the classification layer and SoftMax activation function. The architectural overview of the proposed CNN-LSTM-net is depicted in Figure 3.

ConvLSTM-Net Model
In the ConvLSTM architecture, the CNN layers collect features from input data, while LSTMs facilitate sequence prediction [35]. The ConvLSTM model receives subsets of the main set of input as blocks and extracts features out of each block, then lets the LSTM analyze those features to obtain the prediction. In order to operationalize this concept, we took the approach of dividing each window of n time steps into equal-sized sub-sequences for the CNN architecture to process. In this proposed model, we divided each window of 128-time steps into four corresponding sub-sequences of 32 time steps.
We then created a CNN architecture that reads the sequences of 32-time steps with n features.

•
We wrapped the complete CNN model in a time-distributed layer, allowing it to read in each of the four sub-sequences.

•
In the proposed model, we have used three time-distributed layers of the mentioned type, and the output from them was provided to a two-layer stacked LSTM.

•
The output obtained from the LSTM layer was forwarded to the classification layer, which was made up of two fully connected (FC)layers that use the SoftMax Activation function on their inputs.

•
The number of filters and kernel size of the 1D convolutional layers and the number of hidden units present in the LSTM layers were determined by implementing a random search for a range of values for these parameters.
The overall architecture of the proposed ConvLSTM-net is shown in Figure 4.
main set of input as blocks and extracts features out of each block, then lets the LSTM analyze those features to obtain the prediction. In order to operationalize this concept, we took the approach of dividing each window of n time steps into equal-sized sub-sequences for the CNN architecture to process. In this proposed model, we divided each window of 128-time steps into four corresponding sub-sequences of 32 time steps. We then created a CNN architecture that reads the sequences of 32-time steps with n features.


We wrapped the complete CNN model in a time-distributed layer, allowing it to read in each of the four sub-sequences.  In the proposed model, we have used three time-distributed layers of the mentioned type, and the output from them was provided to a two-layer stacked LSTM.  The output obtained from the LSTM layer was forwarded to the classification layer, which was made up of two fully connected (FC)layers that use the SoftMax Activation function on their inputs.  The number of filters and kernel size of the 1D convolutional layers and the number of hidden units present in the LSTM layers were determined by implementing a random search for a range of values for these parameters.
The overall architecture of the proposed ConvLSTM-net is shown in Figure 4.

StackedLSTM-Net Model
Stacked LSTMs have become a well-established approach for solving difficult sequence prediction challenges in deep learning. The Stacked LSTM architecture is made up of multiple LSTM layers present one after another, all processing data one by one. An LSTM layer above sends a sequence of values to the LSTM layer below as input instead of providing a single value. In our proposed model, the Stacked LSTM model comprises two LSTM layers.

StackedLSTM-Net Model
Stacked LSTMs have become a well-established approach for solving difficult sequence prediction challenges in deep learning. The Stacked LSTM architecture is made up of multiple LSTM layers present one after another, all processing data one by one. An LSTM layer above sends a sequence of values to the LSTM layer below as input instead of providing a single value. In our proposed model, the Stacked LSTM model comprises two LSTM layers.

•
Each LSTM layer consisted of 128 hidden units and a dropout layer added to reduce overfitting.

•
Batch normalization was added after each LSTM layer to standardize the inputs to a layer for each mini-batch.

•
The output from the stacked-LSTM was fed to the classification layer, comprising two fully connected (FC) layers, which subject their inputs to the SoftMax activation function. The overall architecture of the proposed StackedLSTM-net is shown in Figure 5.
 Each LSTM layer consisted of 128 hidden units and a dropout layer added to reduce overfitting.  Batch normalization was added after each LSTM layer to standardize the inputs to a layer for each mini-batch.  The output from the stacked-LSTM was fed to the classification layer, comprising two fully connected (FC) layers, which subject their inputs to the SoftMax activation function.
The overall architecture of the proposed StackedLSTM-net is shown in Figure 5. The predictions from these four foundation models are combined to generate the ensemble. The technique we employed for the ensemble of these base models is called stacking (short for stacked generalization) [36], based on the simple idea that rather than utilizing basic functions (such as hard voting) to combine the predictions of all base models in an ensemble, why not train a machine learning model to do it? Figure 6 demonstrates a prediction job performed by such an ensemble on a new instance. The bottom three predictors each predict a different value (P1, P2, P3), and the final predictor (known as a blender or a metalearner) uses these predictions as inputs to generate the final prediction (Pf).  The predictions from these four foundation models are combined to generate the ensemble. The technique we employed for the ensemble of these base models is called stacking (short for stacked generalization) [36], based on the simple idea that rather than utilizing basic functions (such as hard voting) to combine the predictions of all base models in an ensemble, why not train a machine learning model to do it? Figure 6 demonstrates a prediction job performed by such an ensemble on a new instance. The bottom three predictors each predict a different value (P1, P2, P3), and the final predictor (known as a blender or a metalearner) uses these predictions as inputs to generate the final prediction (Pf). two fully connected (FC) layers, which subject their inputs to the SoftMax ac function.
The overall architecture of the proposed StackedLSTM-net is shown in Figur The predictions from these four foundation models are combined to generat semble. The technique we employed for the ensemble of these base models is calle ing (short for stacked generalization) [36], based on the simple idea that rather t lizing basic functions (such as hard voting) to combine the predictions of all base in an ensemble, why not train a machine learning model to do it? Figure 6 demonstrates a prediction job performed by such an ensemble o instance. The bottom three predictors each predict a different value (P1, P2, P3), final predictor (known as a blender or a metalearner) uses these predictions as i generate the final prediction (Pf).  Keeping a hold-out set is a standard method for training a blender. The training data set is initially divided into two subsets. Here, we trained the first layer predictors (or base models) with the first subset (see Figure 7). Keeping a hold-out set is a standard method for training a blender. The training dat set is initially divided into two subsets. Here, we trained the first layer predictors (or bas models) with the first subset (see Figure 7). Then, we take predictions on the second (held-out) set using the first layer predictors Because the predictors never encountered these events during training, the prediction here are "clean". Now, these fresh predictions are stacked to make a new feature set, and the blender is trained using this new feature set and the original target values/labels of th held-out set (see Figure 8). As the blender is trained (in this study, we used a Random Forest Classifier), th predictions based on the test data are taken sequentially (see Figure 9). This is done by Then, we take predictions on the second (held-out) set using the first layer predictors. Because the predictors never encountered these events during training, the predictions here are "clean". Now, these fresh predictions are stacked to make a new feature set, and the blender is trained using this new feature set and the original target values/labels of the held-out set (see Figure 8).
Keeping a hold-out set is a standard method for training a blender. The training data set is initially divided into two subsets. Here, we trained the first layer predictors (or bas models) with the first subset (see Figure 7). Then, we take predictions on the second (held-out) set using the first layer predictors Because the predictors never encountered these events during training, the prediction here are "clean". Now, these fresh predictions are stacked to make a new feature set, and the blender is trained using this new feature set and the original target values/labels of th held-out set (see Figure 8). As the blender is trained (in this study, we used a Random Forest Classifier), the predictions based on the test data are taken sequentially (see Figure 9). This is done by As the blender is trained (in this study, we used a Random Forest Classifier), the predictions based on the test data are taken sequentially (see Figure 9). This is done by taking predictions from the base models or predictors of the first layer on the test set, then stacking those predictions to make a new feature set; finally, the trained blender provides the final prediction on this stacked feature set.
osensors 2022, 12, x FOR PEER REVIEW taking predictions from the base models or predictors of the first layer on stacking those predictions to make a new feature set; finally, the trained the final prediction on this stacked feature set. From the above discussion of how our proposed Ensem-HAR mode seen that the proposed Ensem-HAR model is not a stand-alone model; in bination (technically called an ensemble) of four deep learning classifiers made in prediction by any particular classifier can be neutralized by o the final ensemble. The main factor to be considered here is that the cla distinct enough that that they exhibit different characteristics in their hence can complement each other's errors, which in turn lead to higher accuracy in predicting human activities. From the above discussion of how our proposed Ensem-HAR model works, it can be seen that the proposed Ensem-HAR model is not a stand-alone model; instead, it is a combination (technically called an ensemble) of four deep learning classifiers. Thus, the errors made in prediction by any particular classifier can be neutralized by other classifiers in the final ensemble. The main factor to be considered here is that the classifiers should be distinct enough that that they exhibit different characteristics in their predictions, and hence can complement each other's errors, which in turn lead to higher final recognition accuracy in predicting human activities.  [9] created the mentioned dataset by capturing different human activities such as 'sitting', 'walking', 'jogging', 'standing', 'walking downstairs', and 'walking upstairs' with a sampling rate of 20 Hz using an accelerometer integrated with the participants' smartphones; 36 participants completed the six exercises listed above, and for each exercise, acceleration was recorded along three axes (x, y, z), constituting three features. Then, the raw-sensor data were segmented into fixed-sized windows with 50% overlap (128 readings per window).

PAMAP2 Dataset
A physical activity tracking dataset was developed by A. Reiss and D. Stricker [37] which includes a variety of activities carried out by nine participants. All of the participants were given eighteen different activities to complete (out of which six activities were optional), including 'rope Jumping', 'running', 'soccer', etc. Three sensors placed at different sites on the participants' bodies were utilized to capture activity-related data. At a sampling rate of 100 Hz (i.e., in each second 100 samples are recorded), a total of 52 features were recorded. In this study, twelve out of the eighteen daily activities were used for experimental purposes. In addition, the researchers segmented the sensor data into fixed-sized windows with a 50-per cent overlap (128 readings per window).

UCI-HAR Dataset
Anguita et al. [38] compiled the mentioned dataset. A total of 30 individuals took part in the study, which included daily human activities such as 'sitting', 'lying', 'walking', 'standing', 'walking upstairs', and 'walking downstairs'. Through an accelerometer and a gyroscope installed on subjects' smartphones, the authors were able to capture linear acceleration and angular velocities along three axes (i.e., x, y, z). At a sampling rate of 50 Hz (i.e., each second, 50 samples are recorded), a total of nine features were captured and with a 50% overlap, the data was segmented into fixed-sized windows (128 readings per window). There are a total of 10,299 samples in the mentioned dataset, which is already segregated according to user ID.

Machine Specification
We performed the training and the testing of the base models and the proposed ensemble of them on a machine equipped with an AMD Ryzen5 2500U CPU, 16Gigs of RAM, and an NVIDIA GeForce GTX 1050 GPU. The machine runs on a Windows 10 operating system with 64 Bits. For the development of the proposed model, we used the Python (3.9), TensorFlow (2.7.0), Keras, and Scikit-learn libraries.

Evaluation Metrics
The dataset was divided into two sets for the evaluation procedure, a training set and a testing set. The model was then adjusted to fit the training set. The prediction is made on the basis of the test set,.
To train each of the base models used in the ensemble, certain hyperparameters were used; for instance, we used 64 as batch size, and the count of epochs was 30. The loss caused in the training was quantified using the categorical cross-entropy, which was then optimized using an efficient gradient-descent technique called the Adam optimizer. In the following discussion, we describe the fundamental performance metrics used in this HAR study.
The performance metrics "Precision", "Recall", "F1-Score", and "Accuracy" were used to evaluate our HAR models. First, we defined accuracy, which was estimated by dividing the number of accurately categorized instances by the total number of samples.
Another important metric is the F1-measure (F1-score), which is a single metric that integrates precision and recall. Therefore, the F1-score is more accurate in terms of performance measurement of a model than accuracy. Additionally, as the classes are ranked in importance according to their sample fraction, the F1-score is considered the best choice in cases of class imbalance. The F1-score is expressed as follows: Apart from these evaluation metrics, another important measure of performance in classification models is the "Receiver Operating Characteristic" or ROC curve. It is a graphical representation between the true positive rate (TPR) and false positive rate (FPR) at all levels of classification thresholds. The "Area Under the ROC Curve", or simply AUC, is the two-dimensional area underneath the ROC curve. Its value lies between 0 and 1. A value of AUC close to 1 indicates that the model is more sensible, while an AUC value of less than 0.5 indicates that the model cannot be considered for making the prediction.

Analysis on WISDM Dataset
The samples of the pre-processed WISDM dataset were split into the training (70%) and test (30%) datasets for the training of the previously mentioned base models and then for the testing on the ensemble of those models, respectively. The confusion matrix obtained from the evaluation of the trained proposed model on the test data is shown in Figure 10. According to the obtained confusion matrix, the classification accuracy of our proposed model is more than around 97% for all six activity classes and the overall accuracy is 98.71%, whereas the base models, viz., 'CNN-net', 'CNN-LSTM-net', 'ConvLSTM-net', and 'StackedLSTM-net', have accuracies of 96.62%, 97.84%, 97.33%, and 98.61%, respectively, as shown in Figure 11. Figure 10. According to the obtained confusion matrix, the classification ac proposed model is more than around 97% for all six activity classes and the racy is 98.71%, whereas the base models, viz., 'CNN-net', 'CNN-LSTM-net', net', and 'StackedLSTM-net', have accuracies of 96.62%, 97.84%, 97.33%, an spectively, as shown in Figure 11.  From the confusion matrix in Figure 10 it can be seen that except for the 'Downstairs' and 'Upstairs' activities, all four other activities were perfectly classified (accuracies are almost 100%) by the proposed ensemble of the base models. It can be observed that around 3.2% of samples in the 'Upstairs' activity were misclassified to 'Downstairs' activity and around 2.4% of samples belonging to 'Downstairs' activity were wrongly classified to 'Upstairs' activity, as they are opposing kinds of activities.  From the confusion matrix in Figure 10 it can be seen that except for the 'Downstairs' and 'Upstairs' activities, all four other activities were perfectly classified (accuracies are almost 100%) by the proposed ensemble of the base models. It can be observed that around 3.2% of samples in the 'Upstairs' activity were misclassified to 'Downstairs' activity and around 2.4% of samples belonging to 'Downstairs' activity were wrongly classified to 'Upstairs' activity, as they are opposing kinds of activities.
A visual comparison of the four individual base models' performance, as well as our proposed Ensem-HAR model, in terms of Precision, Recall and F1-score is shown in Figure 12.   It can be observed from the ROC curve shown in Figure 13 that the area under ROC or AUC of every class with respect to the others is almost '1 , which indicates that our proposed model was able to classify all activity classes efficiently, as seen earlier using the confusion matrix. It can be observed from the ROC curve shown in Figure 13 that the area under ROC or AUC of every class with respect to the others is almost '1′, which indicates that our proposed model was able to classify all activity classes efficiently, as seen earlier using the confusion matrix.

Analysis on PAMAP2 Dataset
The samples present in the processed PAMAP2 dataset were split into the training (70%) and test (30%) datasets for the training of the previously mentioned base models and then for testing on the ensemble of those models, respectively. The Confusion Matrix achieved on evaluation of our trained model on the test data is shown in Figure 14. According to the obtained confusion matrix, the classification accuracy of our proposed model is around 96% in ten out of twelve activity classes, excepting two activity classes ('Ascending_Stairs' and 'Descending_Stairs'). However, the overall accuracy is 97.73%, where the base models 'CNN-net', 'CNN-LSTM-net', 'ConvLSTM-net', and 'StackedLSTM-net' have accuracies of 97.01%, 96.91%, 96.88%, and 95.96%, respectively,

Analysis on PAMAP2 Dataset
The samples present in the processed PAMAP2 dataset were split into the training (70%) and test (30%) datasets for the training of the previously mentioned base models and then for testing on the ensemble of those models, respectively. The Confusion Matrix achieved on evaluation of our trained model on the test data is shown in Figure 14. According to the obtained confusion matrix, the classification accuracy of our proposed model is around 96% in ten out of twelve activity classes, excepting two activity classes ('Ascend-ing_Stairs' and 'Descending_Stairs'). However, the overall accuracy is 97.73%, where the base models 'CNN-net', 'CNN-LSTM-net', 'ConvLSTM-net', and 'StackedLSTM-net' have accuracies of 97.01%, 96.91%, 96.88%, and 95.96%, respectively, as shown in Figure 15.     It can be seen from the confusion matrix shown in Figure 14 that except for two ('As-cending_Stairs' and 'Descending_Stairs'), the activities are well classified by the proposed ensemble of the base models. It can be observed that around 4% of samples for both the 'Ascending_Stairs' and 'Descending_Stairs' activities were misclassified, as they are opposing kinds of activities. There was another a misclassification by our proposed model between the 'Ironing' and 'Standing' activities; around 3% of the sample for 'Standing' is misclassified as 'Ironing', as both cases involve comparable linear acceleration.
The graphical comparison of the four individual models' performance and our proposed Ensem-HAR model with respect to Precision, Recall, and F1-score values for each activity class of the PAMAP2 dataset is shown in Figure 16. It can be observed from the ROC curve shown in Figure 17 that the area under ROC or AUC of every class with respect to the others is almost '1′, which indicates that our proposed model was able to classify all activity classes efficiently, as earlier discussed using the confusion matrix.

Analysis on UCI-HAR Dataset
Out of the total samples present in the UCI-HAR dataset, 7352 samples (training data) were used to train each of the proposed base models and 2947 samples were used as testing data to evaluate the performance of our final ensemble model. The confusion matrix obtained from evaluation of the trained proposed model on the test data is shown in Figure 18. According to the obtained confusion matrix, the classification accuracy of our pro- It can be observed from the ROC curve shown in Figure 17 that the area under ROC or AUC of every class with respect to the others is almost '1 , which indicates that our proposed model was able to classify all activity classes efficiently, as earlier discussed using the confusion matrix. It can be observed from the ROC curve shown in Figure 17 that the area under ROC or AUC of every class with respect to the others is almost '1′, which indicates that our proposed model was able to classify all activity classes efficiently, as earlier discussed using the confusion matrix.

Analysis on UCI-HAR Dataset
Out of the total samples present in the UCI-HAR dataset, 7352 samples (training data) were used to train each of the proposed base models and 2947 samples were used as testing data to evaluate the performance of our final ensemble model. The confusion matrix obtained from evaluation of the trained proposed model on the test data is shown in Figure 18. According to the obtained confusion matrix, the classification accuracy of our proposed model is greater than 94% in four out of six activity classes, excepting the activity class 'Sitting' and 'Standing'. However, the overall accuracy is 95.05%, where the base models 'CNN-net', 'CNN-LSTM-net', 'ConvLSTM-net', and 'StackedLSTM-net' have accuracies of 92.64%, 93.52%, 92.53%, and 92.16%, respectively, as shown in Figure 19.

Analysis on UCI-HAR Dataset
Out of the total samples present in the UCI-HAR dataset, 7352 samples (training data) were used to train each of the proposed base models and 2947 samples were used as testing data to evaluate the performance of our final ensemble model. The confusion matrix obtained from evaluation of the trained proposed model on the test data is shown in Figure 18. According to the obtained confusion matrix, the classification accuracy of our proposed model is greater than 94% in four out of six activity classes, excepting the activity class 'Sitting' and 'Standing'. However, the overall accuracy is 95.05%, where the base models 'CNN-net', 'CNN-LSTM-net', 'ConvLSTM-net', and 'StackedLSTM-net' have accuracies of 92.64%, 93.52%, 92.53%, and 92.16%, respectively, as shown in Figure 19.
Biosensors 2022, 12, x FOR PEER REVIEW Figure 18. Confusion matrix achieved from the evaluation of our proposed Ensem-HAR m the UCI-HAR dataset.
From the confusion matrix (Figure 18), it can be observed that except for two classes, i.e., 'Sitting' and 'Standing'; all other activity labels are classified fairly w to the comparable nature of linear acceleration, misclassification occurred betwe two classes. In addition, our model misclassified samples of 'Walking' activity into ing_Upstairs' and 'Walking_Downstairs' activities, as these acts of ascending scending while walking had similarity with normal 'Walking' for aged persons, do these at a slow rate. Figure 19. Accuracy (in %) comparison between four base models and Ensem-HAR mod UCI-HAR dataset.
A visual comparison of the four individual models' performance, as well as o posed Ensem-HAR model, is shown in Figure 20. From the confusion matrix ( Figure 18), it can be observed that except for two activity classes, i.e., 'Sitting' and 'Standing'; all other activity labels are classified fairly well. Due to the comparable nature of linear acceleration, misclassification occurred between these two classes. In addition, our model misclassified samples of 'Walking' activity into 'Walk-ing_Upstairs' and 'Walking_Downstairs' activities, as these acts of ascending and descending while walking had similarity with normal 'Walking' for aged persons, as they do these at a slow rate.  From the confusion matrix (Figure 18), it can be observed that except for two activity classes, i.e., 'Sitting' and 'Standing'; all other activity labels are classified fairly well. Due to the comparable nature of linear acceleration, misclassification occurred between these two classes. In addition, our model misclassified samples of 'Walking' activity into 'Walking_Upstairs' and 'Walking_Downstairs' activities, as these acts of ascending and descending while walking had similarity with normal 'Walking' for aged persons, as they do these at a slow rate.
A visual comparison of the four individual models' performance, as well as our proposed Ensem-HAR model, is shown in Figure 20. It can be observed from the ROC curve shown in Figure 21 that, except for two activity classes, i.e., 'Sitting' and 'Standing', the area under the ROC curve or AUC is almost '1′ for the rest of the activity classes, while for these two classes the values are 0.89 and 0.95 respectively. This concludes the observation we made from the confusion matrix ( Figure  18), i.e., except for these two activity classes, others have larger accuracies.  It can be observed from the ROC curve shown in Figure 21 that, except for two activity classes, i.e., 'Sitting' and 'Standing', the area under the ROC curve or AUC is almost '1 for the rest of the activity classes, while for these two classes the values are 0.89 and 0.95 respectively. This concludes the observation we made from the confusion matrix ( Figure 18), i.e., except for these two activity classes, others have larger accuracies. It can be observed from the ROC curve shown in Figure 21 that, except for two activity classes, i.e., 'Sitting' and 'Standing', the area under the ROC curve or AUC is almost '1′ for the rest of the activity classes, while for these two classes the values are 0.89 and 0.95 respectively. This concludes the observation we made from the confusion matrix ( Figure  18), i.e., except for these two activity classes, others have larger accuracies.

Statistical Test
In the previous section, we have performed a detailed analysis of the performance of our proposed model on three benchmark HAR datasets and found that the proposed ensemble of the four base models outperforms each of them in terms of accuracy. For the concrete establishment of the superiority and effectiveness of our proposed ensemble model over the base models, we performed a non-parametric statistical test called the Friedman test [39].
For the Friedman test, we randomly chose ten different subsets, each consisting of 50 samples from the test data for each considered dataset, where all the class labels have equal representation. Then, the classification accuracies of each model over those samples were measured and ranked according to their accuracies and we calculated the mean rank for each model over all the ten samples using the formula where, r i j is the rank of jth classifier or model for the ith sample. The calculated mean ranks of the classifiers are shown in Table 4. The null hypothesis (H 0 ) states that all the classifiers or models are the same. Therefore, their rank must be equal. For the justification of the null hypothesis, we calculated the value of the Friedman statistic by the following formula [40]: where k is the number of classifiers (here, 5) and N is the number of sample datasets (here, 10). The calculated value of the statistic for the three different HAR datasets used in this experiment is shown in Table 5. It can be seen from the Chi-square table (shown in Figure 22) that at k − 1 (here 4) degrees of freedom (d.o.f), the standard Friedman static value at significance level 0.05 is found to be 9.49 which is much less than the calculated ones in Table 5. Hence, we can reject the null hypothesis. It can be said from the above experiment that the results achieved by the base models and proposed ensemble model are statistically significant, i.e., not equivalent. Informed Consent Statement: Not applicable.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article. We have used only publicly available datasets for experimentation. Figure 22. Chi-Square Table. As it can be seen that the classifiers (or models) are not equivalent, in order to establish the effectiveness of our proposed model we performed a post hoc analysis through the Nemenyi test [40] and calculated the pairwise Nemenyi score, which is the pairwise p-value between two classifiers (or models), and plotted them through a heat map (as illustrated in Figure 23).  Table. As it can be seen that the classifiers (or models) are not equivalent, in order to establish the effectiveness of our proposed model we performed a post hoc analysis through the Nemenyi test [40] and calculated the pairwise Nemenyi score, which is the pairwise -value between two classifiers (or models), and plotted them through a heat map (as illustrated in Figure 23). From Figure 23, it can be observed that the pairwise -values between our proposed Ensem-HAR model and the other base models are less than the significance level of 0.05. Hence, it can be concluded that our proposed model is more effective than the four base models, which verifies the effectiveness of the ensemble setup. From Figure 23, it can be observed that the pairwise p-values between our proposed Ensem-HAR model and the other base models are less than the significance level of 0.05. Hence, it can be concluded that our proposed model is more effective than the four base models, which verifies the effectiveness of the ensemble setup.
For the above-mentioned statistical tests, it should be noted that we randomly considered ten different subsets from the test data. Due to this randomness in selecting the samples, the results of the same test may vary. This means that if the same statistical tests are performed again by selecting random subsets just as before, the accuracies of the base models as well as the final ensemble may change significantly. This can subsequently lead to an alteration in the ranks of the models, and hence the calculated Friedman statistic value may be changed. As a result, the final ensemble model can become statistically insignificant at the significance level of 0.05. Similarly, either for a new HAR dataset other than the datasets used in this experiment or for any other specific application, it should be kept in mind that the proposed Ensem-HAR model may not perform well as the one in this study compared to the four base models. Table 6 shows a comparative study of our proposed Ensem-HAR model with other approaches that have been used for HAR-based problems on the selected benchmark datasets. Based on the comparison of the proposed model to various previous approaches, it can be concluded that our method outperforms the ones listed below. Although there are several approaches mentioned in Table 6, such as U-Net by Zhang et al. [11], which have better accuracy on the UCI-HAR dataset, our model outperforms them in the WISDM dataset. Similarly, the ST-deepHAR proposed by Abdel-Basset et al. [40] has slightly better accuracy for the WISDM dataset in comparison with our proposed Ensem-HAR model; however, the computational complexity of this work is high compared to our proposed Ensem-HAR model.  Table 7 shows that our proposed ensemble method (Ensem-HAR) is better than several well-known state-of-the-art ensemble techniques, viz., Max Voting, Average (based on the average of class probabilities of each model), and Weighted Average (based on average of class probabilities multiplied by model's weight) in terms of classification accuracy tested on three HAR datasets.

Conclusions
In this paper, we have presented an ensemble measurement-based deep learning-based model utilizing four CNN and LSTM-based models called Ensem-HAR for smartphone sensor-based HAR problems. We considered three conventional and publicly available datasets; our proposed model performed well on these datasets and did a commendable job in the prediction of activities with good accuracy. Although in certain cases a high correlation between activities leads to misclassification, it was able to outperformed several recent methods applied to the mentioned datasets. This work, however, has room for improvement. The models we used to construct the ensemble can be selected in such a way that they are slightly distinct from one another. As a result, each component of the ensemble can show more diversity in its traits. This makes the ensemble more accurate. In addition, it is possible to take different measurement approaches to forming the ensemble of the base models. Furthermore, before being fitted into a model, extra work on time-series data processing can be carried out. Each time-series of raw sensor data can be converted into an image or matrix using concepts such as "Gramian Angular Fields (GAF)" and "Markov Transition Fields (MTF)" [43]. Then, transfer learning techniques [44] can be used on those images, or alternatively can be used in models build with CNN and other deep learning measurement architectures [45]. Moreover, it is possible to apply deep temporal Conv-LSTM architecture [46] in order to improve the overall performance of HAR by using both temporal features from sensor data as well as the relationship of sliding windows.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article. We have used only publicly available datasets for experimentation.